Skip to content

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)#22051

Merged
Fridge003 merged 1 commit intosgl-project:mainfrom
froststeam:qzg/musa-fa-fix
Apr 10, 2026
Merged

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)#22051
Fridge003 merged 1 commit intosgl-project:mainfrom
froststeam:qzg/musa-fa-fix

Conversation

@froststeam
Copy link
Copy Markdown
Contributor

@froststeam froststeam commented Apr 3, 2026

Motivation

This PR fixes the Flash Attention backend support that was previously merged in PR #17985 but later reverted in PR #22002 due to a bug. The original commit 2373552 caused CI failures (see failed CI job).

Previously, the MUSA-adapted flash attention implementation had a bug in the _forward_extend_impl method. The code was missing a proper mechanism to select the correct kernel implementation based on the fa_impl_ver parameter, causing it to always use the default FA3 implementation regardless of the specified version.

Fix Applied

After rebasing to the latest main branch, the kernel selection logic has been refactored and moved to the FlashAttentionBackend.__init__ method. This ensures that the appropriate flash attention implementation is selected during initialization based on the fa_impl_ver parameter.

  1. Moved kernel selection to __init__: The logic to select the correct flash attention kernel (including MUSA-specific implementations) is now handled in the FlashAttentionBackend.__init__ method, where two instance variables are initialized:

    • self.flash_attn_with_kvcache: For cached attention operations
    • self.flash_attn_varlen_func: For variable-length attention operations
  2. Updated forward methods: Both _forward_extend_impl and _forward_decode_impl now use these instance variables instead of directly calling the default implementations, ensuring the correct kernel is used based on the initialized configuration.

Accuracy Tests

root@324a10004:/sgl-workspace/sglang# python3 -m sglang.launch_server \
            --model-path /mnt/seed17/001688/models/Qwen2.5-7B-Instruct/ \
            --served-model-name base-model \
            --trust-remote-code \
            --mem-fraction-static 0.80 \
            --cuda-graph-bs $(seq 1 2) \
            --host 0.0.0.0 \
            --port 30000 \
            --attention-backend fa3 \
            --tp-size 2 \
            --pp-size 2 \
            --disable-radix-cache \
            --chunked-prefill-size -1
2026-04-09 20:05:05 | warnings | 140537684047680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(

2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : Platform plugin musa is activated
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:06 | _custom_ops | 140537684047680 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:06 | warnings | 140537684047680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:06 | warnings | 140537684047680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:07 | server_args | 140537684047680 | WARNING : Pipeline parallelism is incompatible with overlap schedule.
[2026-04-09 20:05:07] server_args=ServerArgs(model_path='/mnt/seed17/001688/models/Qwen2.5-7B-Instruct/', tokenizer_path='/mnt/seed17/001688/models/Qwen2.5-7B-Instruct/', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, enable_http2=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.8, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=-1, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=64, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='musa', tp_size=2, pp_size=2, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_response_default_include_usage=False, incremental_streaming_output=False, enable_streaming_session=False, random_seed=400815404, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_mfu_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='base-model', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, experts_shared_outer_loras=None, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_dflash_block_size=None, speculative_dflash_draft_window_size=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_max_trie_depth=18, speculative_ngram_capacity=10000000, speculative_ngram_external_corpus_path=None, speculative_ngram_external_sam_budget=0, speculative_ngram_external_corpus_max_tokens=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enforce_disable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_hisparse=False, hisparse_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=2, cuda_graph_bs=[1, 2], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=-1, piecewise_cuda_graph_tokens=[], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, gc_threshold=None, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_prefill_context_parallel=False, prefill_cp_mode='in-seq-split', enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, engine_info_bootstrap_port=6789, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-04-09 20:05:08] Using default HuggingFace chat template with detected content format: string
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : Platform plugin musa is activated
2026-04-09 20:05:15 | __init__ | 140102443353920 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140102443353920 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | __init__ | 140314952234816 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140314952234816 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 140102443353920 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | warnings | 140314952234816 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | __init__ | 139839021729600 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 139839021729600 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 139839021729600 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | __init__ | 140605230712640 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140605230712640 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 140605230712640 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | __init__ | 140715677046592 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140715677046592 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 140715677046592 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | warnings | 140102443353920 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 140314952234816 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 139839021729600 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 140605230712640 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 140715677046592 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

[2026-04-09 20:05:15 PP1 TP0] Process 1187068 gpu_id 2 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
[2026-04-09 20:05:15 PP1 TP1] Process 1187069 gpu_id 3 is running on CPUs: [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127]
[2026-04-09 20:05:15 PP0 TP1] Process 1187067 gpu_id 1 is running on CPUs: [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127]
[2026-04-09 20:05:16 PP0 TP0] Process 1187066 gpu_id 0 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
[2026-04-09 20:05:16 PP1 TP1] Init torch distributed begin.
[2026-04-09 20:05:16 PP1 TP0] Init torch distributed begin.
[2026-04-09 20:05:16 PP0 TP1] Init torch distributed begin.
[2026-04-09 20:05:16 PP0 TP0] Init torch distributed begin.
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-09 20:05:18 PP1 TP0] sglang is using nccl==2.11.4
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-09 20:05:18 PP0 TP0] sglang is using nccl==2.11.4
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-09 20:05:20 PP0 TP1] sglang is using nccl==2.11.4
[2026-04-09 20:05:20 PP0 TP0] sglang is using nccl==2.11.4
[2026-04-09 20:05:20 PP0 TP0] Init torch distributed ends. elapsed=3.35 s, mem usage=0.89 GB
[2026-04-09 20:05:20 PP1 TP1] Init torch distributed ends. elapsed=3.69 s, mem usage=0.97 GB
[2026-04-09 20:05:20 PP0 TP1] Init torch distributed ends. elapsed=3.64 s, mem usage=0.97 GB
[2026-04-09 20:05:20 PP1 TP0] Init torch distributed ends. elapsed=3.64 s, mem usage=0.97 GB
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP1 TP0] Load weight begin. avail mem=78.37 GB
[2026-04-09 20:05:20 PP1 TP1] Load weight begin. avail mem=78.37 GB
[2026-04-09 20:05:20 PP0 TP0] Load weight begin. avail mem=78.26 GB
[2026-04-09 20:05:20 PP0 TP1] Load weight begin. avail mem=78.37 GB
Multi-thread loading shards:  50% Completed | 2/4 [00:06<00:07,  3.54s/it][2026-04-09 20:05:30 PP0 TP1] Parameter lm_head.weight not found in params_dict
[2026-04-09 20:05:30 PP0 TP1] Parameter model.norm.weight not found in params_dict
[2026-04-09 20:05:30 PP0 TP0] Parameter lm_head.weight not found in params_dict
[2026-04-09 20:05:30 PP0 TP0] Parameter model.norm.weight not found in params_dict
Multi-thread loading shards: 100% Completed | 4/4 [00:12<00:00,  3.05s/it]
[2026-04-09 20:05:36 PP0 TP1] Load weight end. elapsed=15.82 s, type=Qwen2ForCausalLM, avail mem=74.52 GB, mem usage=3.86 GB.
[2026-04-09 20:05:36 PP0 TP0] Load weight end. elapsed=15.83 s, type=Qwen2ForCausalLM, avail mem=74.40 GB, mem usage=3.86 GB.
[2026-04-09 20:05:36 PP0 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-09 20:05:37 PP1 TP1] Parameter model.embed_tokens.weight not found in params_dict
[2026-04-09 20:05:37 PP1 TP0] Parameter model.embed_tokens.weight not found in params_dict
[2026-04-09 20:05:37 PP1 TP1] Load weight end. elapsed=16.38 s, type=Qwen2ForCausalLM, avail mem=74.52 GB, mem usage=3.86 GB.
[2026-04-09 20:05:37 PP1 TP0] Load weight end. elapsed=16.38 s, type=Qwen2ForCausalLM, avail mem=74.52 GB, mem usage=3.86 GB.
[2026-04-09 20:05:37 PP1 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-09 20:05:37 PP0 TP0] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP1 TP0] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP0 TP0] Memory pool end. avail mem=14.99 GB
[2026-04-09 20:05:37 PP1 TP0] Memory pool end. avail mem=15.11 GB
[2026-04-09 20:05:37 PP0 TP1] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP1 TP1] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP0 TP1] Memory pool end. avail mem=15.11 GB
[2026-04-09 20:05:37 PP1 TP1] Memory pool end. avail mem=15.11 GB
[2026-04-09 20:05:38 PP0 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=15.05 GB
[2026-04-09 20:05:38 PP1 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=15.05 GB
[2026-04-09 20:05:38 PP1 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=15.05 GB
[2026-04-09 20:05:38 PP1 TP0] Capture cuda graph bs [1, 2]
[2026-04-09 20:05:38 PP0 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=14.94 GB
[2026-04-09 20:05:38 PP0 TP0] Capture cuda graph bs [1, 2]
Capturing batches (bs=1 avail_mem=14.32 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:02<00:00, 31.37s/it]
[2026-04-09 20:06:41 PP1 TP0] Registering 56 cuda graph addresses
[2026-04-09 20:06:42 PP1 TP1] Capture cuda graph end. Time elapsed: 63.94 s. mem usage=0.74 GB. avail mem=14.32 GB.
[2026-04-09 20:06:42 PP1 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-09 20:06:42 PP1 TP0] Capture cuda graph end. Time elapsed: 63.95 s. mem usage=0.74 GB. avail mem=14.32 GB.
[2026-04-09 20:06:42 PP1 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
Capturing batches (bs=1 avail_mem=14.21 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:04<00:00, 32.34s/it]
[2026-04-09 20:06:43 PP0 TP0] Registering 58 cuda graph addresses
[2026-04-09 20:06:44 PP0 TP1] Capture cuda graph end. Time elapsed: 66.49 s. mem usage=0.74 GB. avail mem=14.32 GB.
[2026-04-09 20:06:44 PP0 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-09 20:06:44 PP0 TP0] Capture cuda graph end. Time elapsed: 66.50 s. mem usage=0.74 GB. avail mem=14.20 GB.
[2026-04-09 20:06:44 PP0 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-09 20:06:45 PP0 TP0] max_total_num_tokens=4400192, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=14.20 GB
[2026-04-09 20:06:45 PP1 TP0] max_total_num_tokens=4400192, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=14.32 GB
[2026-04-09 20:06:45] INFO:     Started server process [1185766]
[2026-04-09 20:06:45] INFO:     Waiting for application startup.
[2026-04-09 20:06:45] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2026-04-09 20:06:45] INFO:     Application startup complete.
[2026-04-09 20:06:45] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2026-04-09 20:06:46] INFO:     127.0.0.1:35942 - "GET /model_info HTTP/1.1" 200 OK
[2026-04-09 20:06:47 PP0 TP1] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP0 TP0] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP1 TP1] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP1 TP0] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-09 20:06:48 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-09 20:06:49] INFO:     127.0.0.1:35956 - "POST /generate HTTP/1.1" 200 OK
[2026-04-09 20:06:49] The server is fired up and ready to roll!
[2026-04-09 20:06:59 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 5.84
[2026-04-09 20:06:59 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 6.08
[2026-04-09 20:06:59 PP0 TP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.39, #queue-req: 0
[2026-04-09 20:06:59 PP1 TP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.39, #queue-req: 0
[2026-04-09 20:07:01 PP0 TP0] Decode batch, #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 24.78, #queue-req: 0
[2026-04-09 20:07:01 PP1 TP0] Decode batch, #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 24.77, #queue-req: 0
[2026-04-09 20:07:01 PP0 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 59.19, #queue-req: 0
[2026-04-09 20:07:01 PP1 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 59.23, #queue-req: 0
[2026-04-09 20:07:02 PP0 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.99, #queue-req: 0
[2026-04-09 20:07:02 PP1 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 59.03, #queue-req: 0
[2026-04-09 20:07:03 PP0 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.76, #queue-req: 0
[2026-04-09 20:07:03 PP1 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.76, #queue-req: 0
[2026-04-09 20:07:04 PP0 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.73, #queue-req: 0
[2026-04-09 20:07:04 PP1 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.73, #queue-req: 0
[2026-04-09 20:07:04 PP0 TP0] Decode batch, #running-req: 1, #token: 320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.69, #queue-req: 0
[2026-04-09 20:07:04 PP1 TP0] Decode batch, #running-req: 1, #token: 320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.69, #queue-req: 0
[2026-04-09 20:07:05] INFO:     127.0.0.1:56346 - "POST /generate HTTP/1.1" 200 OK

python3 /sglang/python/sglang/test/few_shot_gsm8k.py
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:54<00:00,  1.75it/s]
Accuracy: 0.870
Invalid: 0.000
Latency: 114.105 s
Output throughput: 322.844 token/s

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process .
  2. Get approvals from https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS and other reviewers.
  3. Trigger CI tests with https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Related Links:

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the MUSA (Moore Threads GPU) hardware backend, specifically focusing on Flash Attention integration. It adds necessary dependencies, configuration parameters, and a new MUSA-specific attention module that wraps the mate library's flash attention functions. The implementation uses a thread-local context manager to automatically inject scheduler metadata into attention calls. Key changes include updates to the attention registry, the FlashAttentionBackend to handle MUSA-specific logic, and server argument adjustments for MUSA compatibility. Feedback highlights potential issues with global buffer safety in multi-GPU environments, metadata cache collisions due to non-unique keys, and the implications of ignoring cu_seqlens_k_new in the MUSA implementation.

Comment thread python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Outdated
Comment thread python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Outdated
Comment thread python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Outdated
@froststeam froststeam changed the title [MUSA][9/N] Re-introduceFA3 attention backend support through MATE (MUSA AI Tensor Engine) [MUSA][9/N] Re-introduce FA3 attention backend support through MATE Apr 3, 2026
Copy link
Copy Markdown
Collaborator

@yeahdongcn yeahdongcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to split this into two commits: one carrying over changes from the previous PR, and another fixing the regression in selecting FA kernels for different NVIDIA GPU architectures. This should make it easier for the SGLang core team to review.

Comment thread python/sglang/srt/layers/attention/flashattention_backend.py Outdated
@yeahdongcn yeahdongcn requested a review from Kangyan-Zhou April 5, 2026 13:20
@froststeam froststeam changed the title [MUSA][9/N] Re-introduce FA3 attention backend support through MATE [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) Apr 6, 2026
@froststeam froststeam force-pushed the qzg/musa-fa-fix branch 3 times, most recently from 9cb257c to 0af5fe5 Compare April 6, 2026 12:58
Comment thread python/sglang/srt/layers/attention/flashattention_backend.py Outdated
@yeahdongcn
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

4 similar comments
@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@froststeam
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@froststeam
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

Comment thread python/sglang/srt/layers/attention/flashattention_backend.py
Comment thread python/sglang/srt/layers/attention/attention_registry.py
Comment thread python/sglang/srt/hardware_backend/musa/attention/flashattention_backend.py Outdated
@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

3 similar comments
@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@froststeam
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator

Hi @Fridge003 and @Kangyan-Zhou, all NVIDIA CI checks have passed. Could you please take a look if we can merge this? Thanks!

@Fridge003 Fridge003 merged commit f7a1740 into sgl-project:main Apr 10, 2026
293 of 342 checks passed
Fridge003 pushed a commit that referenced this pull request Apr 11, 2026
…ensor Engine) (#22051)

Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
…ensor Engine) (sgl-project#22051)

Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file jit-kernel mthreads run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants